{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Report 1 - 交通事故理赔审核预测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 杨建 2019300298"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一、问题背景"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在交通摩擦（事故）发生后，理赔员会前往现场勘察、采集信息，这些信息往往影响着车主是否能够得到保险公司的理赔。训练集数据包括理赔人员在现场对该事故方采集的36条信息，信息已经被编码，以及该事故方最终是否获得理赔。我们的任务是根据这36条信息预测该事故方没有被理赔的概率。\n",
    "\n",
    "## #数据介绍：训练集中共有200000条样本，预测集中有80000条样本。\n",
    "![](./data_description.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二、报告目的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "利用所给的训练集中的不同问题，进行二元分类预测，得到测试集的评估估结果，与简单提交样本一样的结果。在找到真实结果进行准确性比对。评估自己的分类方法进行改进。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 三、实验原理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "先观察数据发现，数据的体量特别大，训练集有二十万条数据而测试集有八万组数据。所以必须采用计算速度快的分类方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可利用线性回归的方法快速计算\n",
    "### 1.Logistic回归算法\n",
    "Logistic 回归通过使用其固有的 logistic 函数估计概率，来衡量因变量（我们想要预测的标签）与一个或多个自变量（特征）之间的关系。\n",
    "下面的图片说明了 logistic 回归得出预测所需的所有步骤。\n",
    "![原理图](./原理图.png)  ![sigmiod函数](./原理图1.jpg)\n",
    "\n",
    "我们希望随机数据点被正确分类的概率最大化，这就是最大似然估计。最大似然估计是统计模型中估计参数的通用方法。\n",
    "### 2.梯度下降分类\n",
    "我们可以仿造线性回归的思路构造如下代价函数进行梯度下降法计算：\n",
    "$$J(\\theta)=\\sum\\frac{1}{2}(h_\\theta-x)^2$$\n",
    "梯度下降可能得到局部最优，但在优化问题里我们已经证明线性回归只有一个最优点，因为损失函数J(θ)是一个二次的凸函数，不会产生局部最优的情况。\n",
    "### 3.随机森林算法原理在报告2中有说明"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 四、实验过程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入所需库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import sklearn\n",
    "from sklearn import model_selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入数据集并进行分析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = pd.read_csv(\"train.csv\")\n",
    "test = pd.read_csv(\"test.csv\")\n",
    "submit = pd.read_csv(\"sample_submit.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(200000, 38)\n",
      "(80000, 37)\n",
      "(80000, 2)\n"
     ]
    }
   ],
   "source": [
    "print(train.shape)\n",
    "print(test.shape)\n",
    "print(submit.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CaseId</th>\n",
       "      <th>Q1</th>\n",
       "      <th>Q2</th>\n",
       "      <th>Q3</th>\n",
       "      <th>Q4</th>\n",
       "      <th>Q5</th>\n",
       "      <th>Q6</th>\n",
       "      <th>Q7</th>\n",
       "      <th>Q8</th>\n",
       "      <th>Q9</th>\n",
       "      <th>...</th>\n",
       "      <th>Q28</th>\n",
       "      <th>Q29</th>\n",
       "      <th>Q30</th>\n",
       "      <th>Q31</th>\n",
       "      <th>Q32</th>\n",
       "      <th>Q33</th>\n",
       "      <th>Q34</th>\n",
       "      <th>Q35</th>\n",
       "      <th>Q36</th>\n",
       "      <th>Evaluation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   CaseId  Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  ...  Q28  Q29  Q30  Q31  Q32  \\\n",
       "0       1   0   0   0   0   0   0   0   0   0  ...    0    0    0    0    0   \n",
       "1       2   0   0   0   0   0   0   0   0   0  ...    0    1    1    1    1   \n",
       "2       3   0   0   0   0   0   0   0   1   0  ...    1    2    2    2    1   \n",
       "3       4   0   0   0   0   0   0   0   0   0  ...    1    3    2    3    1   \n",
       "4       5   0   0   0   0   0   0   0   0   0  ...    1    4    2    4    1   \n",
       "\n",
       "   Q33  Q34  Q35  Q36  Evaluation  \n",
       "0    0    0    0    0           0  \n",
       "1    0    0    0    0           0  \n",
       "2    0    0    0    0           0  \n",
       "3    0    0    1    1           0  \n",
       "4    0    0    1    1           0  \n",
       "\n",
       "[5 rows x 38 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练数据由训练编号、采集到的36条信息，和审核结果三部分组成"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CaseId</th>\n",
       "      <th>Q1</th>\n",
       "      <th>Q2</th>\n",
       "      <th>Q3</th>\n",
       "      <th>Q4</th>\n",
       "      <th>Q5</th>\n",
       "      <th>Q6</th>\n",
       "      <th>Q7</th>\n",
       "      <th>Q8</th>\n",
       "      <th>Q9</th>\n",
       "      <th>...</th>\n",
       "      <th>Q28</th>\n",
       "      <th>Q29</th>\n",
       "      <th>Q30</th>\n",
       "      <th>Q31</th>\n",
       "      <th>Q32</th>\n",
       "      <th>Q33</th>\n",
       "      <th>Q34</th>\n",
       "      <th>Q35</th>\n",
       "      <th>Q36</th>\n",
       "      <th>Evaluation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>count</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.00000</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>200000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>mean</td>\n",
       "      <td>100000.500000</td>\n",
       "      <td>0.016445</td>\n",
       "      <td>0.010875</td>\n",
       "      <td>0.024925</td>\n",
       "      <td>0.038560</td>\n",
       "      <td>0.039130</td>\n",
       "      <td>0.080335</td>\n",
       "      <td>0.113725</td>\n",
       "      <td>0.294165</td>\n",
       "      <td>0.038810</td>\n",
       "      <td>...</td>\n",
       "      <td>0.611760</td>\n",
       "      <td>4.034760</td>\n",
       "      <td>1.163025</td>\n",
       "      <td>9.428990</td>\n",
       "      <td>0.979740</td>\n",
       "      <td>0.032945</td>\n",
       "      <td>0.005520</td>\n",
       "      <td>0.64946</td>\n",
       "      <td>0.657345</td>\n",
       "      <td>0.158005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>std</td>\n",
       "      <td>57735.171256</td>\n",
       "      <td>0.182496</td>\n",
       "      <td>0.141269</td>\n",
       "      <td>0.213645</td>\n",
       "      <td>0.263483</td>\n",
       "      <td>0.301528</td>\n",
       "      <td>0.508923</td>\n",
       "      <td>0.589884</td>\n",
       "      <td>0.916137</td>\n",
       "      <td>0.267851</td>\n",
       "      <td>...</td>\n",
       "      <td>0.984868</td>\n",
       "      <td>3.815101</td>\n",
       "      <td>0.616417</td>\n",
       "      <td>9.004402</td>\n",
       "      <td>0.238725</td>\n",
       "      <td>0.802953</td>\n",
       "      <td>0.074092</td>\n",
       "      <td>0.47714</td>\n",
       "      <td>0.474598</td>\n",
       "      <td>0.364747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>min</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>25%</td>\n",
       "      <td>50000.750000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>50%</td>\n",
       "      <td>100000.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>75%</td>\n",
       "      <td>150000.250000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>14.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>max</td>\n",
       "      <td>200000.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>...</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>52.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>42.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.00000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              CaseId             Q1             Q2             Q3  \\\n",
       "count  200000.000000  200000.000000  200000.000000  200000.000000   \n",
       "mean   100000.500000       0.016445       0.010875       0.024925   \n",
       "std     57735.171256       0.182496       0.141269       0.213645   \n",
       "min         1.000000       0.000000       0.000000       0.000000   \n",
       "25%     50000.750000       0.000000       0.000000       0.000000   \n",
       "50%    100000.500000       0.000000       0.000000       0.000000   \n",
       "75%    150000.250000       0.000000       0.000000       0.000000   \n",
       "max    200000.000000       3.000000       3.000000       3.000000   \n",
       "\n",
       "                  Q4             Q5             Q6             Q7  \\\n",
       "count  200000.000000  200000.000000  200000.000000  200000.000000   \n",
       "mean        0.038560       0.039130       0.080335       0.113725   \n",
       "std         0.263483       0.301528       0.508923       0.589884   \n",
       "min         0.000000       0.000000       0.000000       0.000000   \n",
       "25%         0.000000       0.000000       0.000000       0.000000   \n",
       "50%         0.000000       0.000000       0.000000       0.000000   \n",
       "75%         0.000000       0.000000       0.000000       0.000000   \n",
       "max         3.000000       3.000000       6.000000       6.000000   \n",
       "\n",
       "                  Q8             Q9  ...            Q28            Q29  \\\n",
       "count  200000.000000  200000.000000  ...  200000.000000  200000.000000   \n",
       "mean        0.294165       0.038810  ...       0.611760       4.034760   \n",
       "std         0.916137       0.267851  ...       0.984868       3.815101   \n",
       "min         0.000000       0.000000  ...       0.000000       0.000000   \n",
       "25%         0.000000       0.000000  ...       0.000000       2.000000   \n",
       "50%         0.000000       0.000000  ...       0.000000       3.000000   \n",
       "75%         0.000000       0.000000  ...       1.000000       5.000000   \n",
       "max         8.000000       4.000000  ...       4.000000      52.000000   \n",
       "\n",
       "                 Q30            Q31            Q32            Q33  \\\n",
       "count  200000.000000  200000.000000  200000.000000  200000.000000   \n",
       "mean        1.163025       9.428990       0.979740       0.032945   \n",
       "std         0.616417       9.004402       0.238725       0.802953   \n",
       "min         0.000000       0.000000       0.000000       0.000000   \n",
       "25%         1.000000       2.000000       1.000000       0.000000   \n",
       "50%         1.000000       6.000000       1.000000       0.000000   \n",
       "75%         1.000000      14.000000       1.000000       0.000000   \n",
       "max         3.000000      42.000000       3.000000      38.000000   \n",
       "\n",
       "                 Q34           Q35            Q36     Evaluation  \n",
       "count  200000.000000  200000.00000  200000.000000  200000.000000  \n",
       "mean        0.005520       0.64946       0.657345       0.158005  \n",
       "std         0.074092       0.47714       0.474598       0.364747  \n",
       "min         0.000000       0.00000       0.000000       0.000000  \n",
       "25%         0.000000       0.00000       0.000000       0.000000  \n",
       "50%         0.000000       1.00000       1.000000       0.000000  \n",
       "75%         0.000000       1.00000       1.000000       0.000000  \n",
       "max         1.000000       1.00000       1.000000       1.000000  \n",
       "\n",
       "[8 rows x 38 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据没有出现残缺、多余的情况"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CaseId</th>\n",
       "      <th>Q1</th>\n",
       "      <th>Q2</th>\n",
       "      <th>Q3</th>\n",
       "      <th>Q4</th>\n",
       "      <th>Q5</th>\n",
       "      <th>Q6</th>\n",
       "      <th>Q7</th>\n",
       "      <th>Q8</th>\n",
       "      <th>Q9</th>\n",
       "      <th>...</th>\n",
       "      <th>Q27</th>\n",
       "      <th>Q28</th>\n",
       "      <th>Q29</th>\n",
       "      <th>Q30</th>\n",
       "      <th>Q31</th>\n",
       "      <th>Q32</th>\n",
       "      <th>Q33</th>\n",
       "      <th>Q34</th>\n",
       "      <th>Q35</th>\n",
       "      <th>Q36</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>200001</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>12</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>200002</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>200003</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>200004</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>200005</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>13</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 37 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   CaseId  Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  ...  Q27  Q28  Q29  Q30  Q31  \\\n",
       "0  200001   0   0   0   0   0   0   4   0   0  ...    0    0    2    1   12   \n",
       "1  200002   0   0   0   0   0   0   0   0   0  ...    0    2    5    0    8   \n",
       "2  200003   0   0   0   0   0   0   0   0   0  ...    0    0    3    1    3   \n",
       "3  200004   0   0   0   0   0   0   0   0   0  ...    0    1    3    1    1   \n",
       "4  200005   0   0   0   0   0   0   0   0   2  ...    0    1    6    1   13   \n",
       "\n",
       "   Q32  Q33  Q34  Q35  Q36  \n",
       "0    1    0    0    0    0  \n",
       "1    1    0    0    1    1  \n",
       "2    1    0    0    1    1  \n",
       "3    1    0    0    1    1  \n",
       "4    1    0    0    1    1  \n",
       "\n",
       "[5 rows x 37 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CaseId</th>\n",
       "      <th>Evaluation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>200001</td>\n",
       "      <td>0.314152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>200002</td>\n",
       "      <td>0.314152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>200003</td>\n",
       "      <td>0.314152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>200004</td>\n",
       "      <td>0.314152</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>200005</td>\n",
       "      <td>0.314152</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   CaseId  Evaluation\n",
       "0  200001    0.314152\n",
       "1  200002    0.314152\n",
       "2  200003    0.314152\n",
       "3  200004    0.314152\n",
       "4  200005    0.314152"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "submit.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据处理得到测试集与训练集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "b=train.columns[1:-2]#测试集标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [],
   "source": [
    "#利用sklearn的model_selection函数来进行数据划分，用百分之25的随机数据作为测试集\n",
    "target=train['Evaluation']\n",
    "x_train,x_test,y_train,y_test=model_selection.train_test_split(train[b],\n",
    "                                                               target,\n",
    "                                                               test_size=0.25,\n",
    "                                                               random_state=33)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.利用Logistic回归进行分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.linear_model import SGDClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 分别利用两个分类器进行分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "D:\\anaconda\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "lr=LogisticRegression()\n",
    "sgdc=SGDClassifier()\n",
    "lr.fit(x_train,y_train)#利用fit函数来训练模型\n",
    "lr_predict=lr.predict(x_test)\n",
    "sgdc.fit(x_train,y_train)#利用sdgc中的fit函数进行训练\n",
    "sdgc_predict=sgdc.predict(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 得到在测试集的准确率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "逻辑回归分类准确率为：0.874940\n",
      "梯度下降分类准确率为：0.867220\n"
     ]
    }
   ],
   "source": [
    "lr_accuracy=lr.score(x_test,y_test)\n",
    "print('逻辑回归分类准确率为：%f'%lr_accuracy)\n",
    "from sklearn.metrics import classification_report\n",
    "sgdc_accuracy=sgdc.score(x_test,y_test)\n",
    "print('梯度下降分类准确率为：%f'%sgdc_accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 因为逻辑回归的准确率更高，所以利用逻辑回归来预测测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(80000,)"
      ]
     },
     "execution_count": 124,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_predict=lr.predict(test[b])\n",
    "test_predict.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(80000,)\n"
     ]
    }
   ],
   "source": [
    "caseld=np.arange(20001,100001)\n",
    "print(caseld.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CaseId</th>\n",
       "      <th>Evaluation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>20001</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>20002</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>20003</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>20004</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>20005</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   CaseId  Evaluation\n",
       "0   20001           0\n",
       "1   20002           0\n",
       "2   20003           0\n",
       "3   20004           0\n",
       "4   20005           0"
      ]
     },
     "execution_count": 126,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lr_submit={\"CaseId\":caseld,'Evaluation':test_predict}\n",
    "lr_submit=pd.DataFrame(lr_submit)\n",
    "lr_submit.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 随机森林分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier#导入随机森林算法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "                       max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "                       min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "                       min_samples_leaf=1, min_samples_split=2,\n",
       "                       min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "                       n_jobs=None, oob_score=False, random_state=0, verbose=0,\n",
       "                       warm_start=False)"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cf = RandomForestClassifier(n_estimators=100, random_state=0)\n",
    "cf.fit(x_train, y_train)#利用fit函数进行模型训练"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 得到测试数据准确率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "随机森林分类准确率为：0.911800\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "y_test_pred = cf.predict(x_test)\n",
    "cf_accuracy = accuracy_score(y_test, y_test_pred)\n",
    "print('随机森林分类准确率为：%f'%cf_accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 随机森林得到的准确率更高"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CaseId</th>\n",
       "      <th>Evaluation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>20001</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>20002</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>20003</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>20004</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>20005</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   CaseId  Evaluation\n",
       "0   20001           0\n",
       "1   20002           0\n",
       "2   20003           0\n",
       "3   20004           0\n",
       "4   20005           0"
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = cf.predict(test[b])\n",
    "rf_submit={\"CaseId\":caseld,'Evaluation':y_pred}\n",
    "rf_submit=pd.DataFrame(rf_submit)\n",
    "rf_submit.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对于两种结果进行比较："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "对比准确率为：0.912737\n"
     ]
    }
   ],
   "source": [
    "contrast_accuracy = accuracy_score(y_pred, test_predict)\n",
    "print(\"对比准确率为：%f\"%contrast_accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 说明两种预测是十分接近的，但是逻辑回归明显预测没有随机森林算法得到的结果正确，所以有误差还是能接受的。总体说明两种分类的方法得到的结果都是比较可靠的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 五、心得体会"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 虽然核心算法很难，但是掌握python自带分类器进行分类对于任何一个人来说，只要认真学习一下还是有所收获，在反复的实验中会逐渐掌握熟练。经过本次实验我发现对于实际问题的预测分析，往往只利用一种分类方法得到的结果是无法完全信任的，尤其是数据结构复杂的现实生活中的实际例子，不仅具有十分庞大的数据体量，数据的可用性、复杂性都是很高的。要想得到更好的预测结果就得从全局出发，将分类算法灵活使用，充分发挥不同分类方法的优点，扬长避短最后才能得到最优结果。这是一个优秀的机器学习工作者应该具备的能力。我目前的水平还十分有限，仅仅达到使用的能力，以后还要在底层原理和高层设计方面多多学习。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
